Processor for making more efficient use of idling components and program conversion apparatus for the same

ABSTRACT

A processor that has a plurality of instruction slots each of which stores an instruction to be executed in parallel. One of the plurality of instruction slots is a first instruction slot and another a second instruction slot. A special instruction stored in the first instruction slot is executed by a first functional unit that executes instructions stored in the first instruction slot, and a second functional unit that executes instructions stored in the second instruction slot. An instruction stored in the second instruction slot is executed in parallel by a third functional unit that executes instructions stored in the second instruction slot.

This application is based on an application No. 10-083369 filed inJapan, the content of which is hereby incorporated by reference.

BACKGROUND OF THE INVENTION

1. Field of the Invention

The present invention relates to a processor that executes a pluralityof instructions in parallel and to a program conversion apparatus forthe same.

1. Description of the Related Art

In recent years, VLIW (Very Long Instruction Word) processors have beendeveloped with the aim of achieving high-speed processing. Theseprocessors use long-word instructions composed of a plurality ofinstructions to execute a number of instructions in parallel.

Japanese Laid-Open Patent No. 5-11979 discloses an example of this kindof technique. FIG. 1 is a block diagram of a processor disclosed in thisdocument.

The processor of FIG. 1 includes a register file 1, an external memory2, an instruction register 3 having four instruction slots, an inputswitching circuit 4, a transfer unit 5, a integer calculation unit 6, atransfer unit 7, an integer calculation unit 8, an integer calculationunit 9, a floating-point unit 10, a branch unit 11, an output switchingcircuit 12 and a register file or external memory 13.

The instruction register 3 stores four instructions, which make up onelong-word instruction, in its four internal instruction slots (hereafterreferred to as ‘slots’). Here, the instruction in each of the first andsecond slots is either an integer calculating instruction or a datatransfer instruction (also referred to as a load/store instruction). Theinstruction in the third slot is a floating-point calculatinginstruction or an integer calculating instruction and that in the fourthslot is a branch instruction. The arrangement of instructions in onelong-word instruction is performed in advance by a compiler.

The transfer unit 5 and the integer calculation unit 6 are aligned withthe first slot, and execute the data transfer and integer calculatinginstructions respectively.

The transfer unit 7 and the integer calculation unit 8 are aligned withthe second slot, and execute the data transfer and integer calculatinginstructions respectively.

The integer calculation unit 9 and the floating-point unit 10 arealigned with the third slot, and execute the integer calculation andfloating-point instructions respectively.

The branch unit 11 is aligned with the fourth slot and executes branchinstructions.

Here, the transfer units 5 and 7, the integer calculation units 6, 8 and9, the floating-point unit 10 and the branch unit 11 are generallyreferred to as functional units.

The input switching circuit 4 inputs source data read from the registerfile 1 or the external memory 2 into the required functional units.

The output switching circuit 12 outputs the results of calculations bythe utilized functional units to the register file or external memory13.

A processor constructed as above decodes and executes instructionsstored in the four slots in parallel. Assume, for example, that an ‘add’instruction for adding register data is stored in the first slot. Theprocessor inputs two pieces of register data from the register file 1into the integer calculation unit 6 via the input switching circuit 4.The two pieces of register data are then added by the integercalculation unit 6 and the result stored in the register file 13 via theoutput switching circuit 12. Instructions in the second, third andfourth slots are also decoded and executed in parallel with thisinstruction.

However, in this kind of conventional processor certain functional unitsare left idling when instructions are executed. When an integercalculating instruction is executed by the third slot, for example, thefloating-point unit is left idling.

SUMMARY OF THE INVENTION

An object of the present invention is to provide a processor thatutilizes idling functional units, thus improving processing performance.

A second object is to provide a processor that executes at a high speedthe product-sum operations frequently used in current multimediaprocessing.

A processor that achieves the above objects includes first and seconddecoding units, first and second executing units corresponding to thefirst and second decoding units, and a selecting unit. The first andsecond executing units decode instructions and generate results denotingtheir content. If the first decoding unit decodes a special instruction,it generates first-part and second-part decode results denoting afirst-type calculation and a second-type calculation. The executingunits execute instructions in parallel according to a decode result fromthe corresponding decoding unit. If the first decoding unit decodes thespecial instruction, the selecting unit selects the second-part decoderesult, and if the first decoding unit decodes an instruction other thanthe special instruction, the selecting unit selects the decode resultfrom the second decoding unit.

The second executing unit includes a first functional unit, whichexecutes instructions according to the decode result selected by theselecting unit, and a second functional unit, which executesinstructions according to the decode result of the second decoding unit.If the special instruction is decoded, the first executing unit performsa first-type calculation, the first functional unit performs asecond-type calculation and the second functional unit executes aninstruction decoded by the second decoding unit.

Here, the special instruction may include an operation code denoting thefirst-type calculation and the second-type calculation, and first andsecond operands. The first executing unit performs the first-typecalculation on the first and second operands, and stores a calculationresult in the first operand. Meanwhile, the second executing unitperforms the second-type calculation on the first and second operands,and stores a calculation result in the second operand.

This structure enables a first-type calculation and a second-typecalculation to be executed by the first and second executing unitsaccording to a special instruction in one instruction slot. This allowsidling functional units to be used, thus increasing processingperformance.

Here, the first executing unit may include an adder/subtracter, thefirst functional unit be an adder/subtracter and the special instructiondenote addition as the first-type calculation and subtraction as thesecond-type calculation.

This structure enables an instruction other than the special instructionto be executed in parallel with the addition and subtraction denoted bythe special instruction, so that the processing performance of theprocessor can be further increased.

Here, the second functional unit is a multiplier and the instruction isa multiply instruction.

This structure enables addition, subtraction and multiplication to beexecuted in parallel, so that product-sum calculations extensively usedin modern multimedia processing can be executed efficiently.

Furthermore, a program conversion apparatus that achieves the aboveobjects is one that changes a source program to an object program for atarget processor executing long-word instructions. This programconversion apparatus includes a retrieving unit, a generating unit andan arranging unit. The retrieving unit retrieves a pair of instructionsdenoting a first-type calculation of two variables and a second-typecalculation of the same two variables from a source program. Thegenerating unit generates a special instruction corresponding to theretrieved pair. This special instruction includes an operation codedenoting the first-type calculation and the second-type calculation, andtwo operands representing the two variables. The arranging unit arrangesthe generated special instruction into a long-word instruction.

This structure generates an object program, composed of a plurality oflong-word instructions. Special instructions supported by the targetprocessor are embedded in certain of the plurality of long-wordinstructions.

Here, the first instruction denotes addition, and the second instructiondenotes subtraction. The target processor includes a first instructionexecution unit having a first calculation unit, and a second instructionexecution unit having a second calculation unit and a multiplicationunit. The arranging unit retrieves a multiply instruction that does notshare dependency with the special instruction generated by thegenerating unit, and arranges the special instruction and the multiplyinstruction in one long-word instruction.

This structure enables addition, subtraction and multiplication to beperformed in parallel by aligning two instructions (a specialinstruction and a multiplication instruction) found in one long-wordinstruction in parallel. This makes the operation suitable for a programcompiler performing product-sum calculations.

BRIEF DESCRIPTION OF THE DRAWINGS

These and other objects, advantages and features of the invention willbecome apparent from the following description thereof taken inconjunction with the accompanying drawings which illustrate a specificembodiment of the invention. In the drawings:

FIG. 1 is a block diagram showing a conventional processor;

FIG. 2 is a block diagram showing a structure for a processor in thepresent embodiment;

FIG. 3 shows the format of instructions;

FIG. 4 shows the instruction set of the processor;

FIG. 5 is a block diagram showing a structure for a decoder aligned to afirst slot;

FIG. 6 is a block diagram showing a structure for a decoder aligned to asecond slot;

FIG. 7 shows the content of control signals output from the decoderaligned to the first slot;

FIG. 8 shows the content of control signals output from the decoderaligned to the second slot;

FIG. 9 shows the relationship between two inputs to a selector on thefirst slot side and an output from the same selector;

FIG. 10 shows the relationship between two inputs to a selector on thesecond slot side and an output from the same selector;

FIG. 11 shows the operation content of a data transfer unit aligned withthe first slot;

FIG. 12 shows the operation content of a calculation unit aligned withthe first slot;

FIG. 13 shows the operation content of a calculation unit aligned withthe second slot;

FIG. 14 shows the operation content of a multiplication unit alignedwith the second slot;

FIG. 15 shows an example source program describing a discrete cosinetransform;

FIG. 16 is a table showing the correspondence between registers andvariables in an example program;

FIG. 17 shows an example program composed of long-word instructions foruse by the processor in the present embodiment;

FIG. 18 shows an example of a program composed of long-word instructionsfor use by a conventional processor; and

FIG. 19 is a block diagram showing a structure for a program conversionapparatus, which converts a source program into a program (executioncode) for use by the processor of the present invention.

DESCRIPTION OF THE PREFERRED EMBODIMENT

Structure of the Processor

FIG. 2 is a block diagram showing the structure of a processor in thepresent embodiment. This processor includes an instruction register 101,instruction execution units 102 and 103 (hereafter referred to as‘execution units’) and register file 112. The execution unit 102includes a decoder 104, a selector 106, a data transfer unit 108 and acalculation unit 109. Furthermore, the execution unit 103 similarlyincludes a decoder 105, a selector 107, a calculation unit 110 and amultiplication unit 111.

For ease of explanation, it is assumed that one long-word instruction inthe present embodiment is composed of two parallel instructions. Theinformation register 101 fetches these instructions from a memory (notshown here) and stores them in first and second instruction slots(hereafter referred to as the ‘first and second slots’). Each slotstores one instruction. The format of these instructions is shown inFIG. 3. Each of the instructions shown in this drawing is composed of afirst field representing an operation code, and second and third fieldsshowing register numbers as operands. The long-word instruction has afixed length. FIG. 3 shows six instructions as representative examples.Of these, an ‘adsb’ instruction is of particular importance to thepresent invention. The ‘adsb’ instruction instructs one of the executionunits 102 and 103 to perform addition and the other subtraction. Theseexecutions take place simultaneously. Hereafter, the ‘adsb’ instructionis also referred to as the ‘special instruction’ and other instructionsas ‘standard instructions’.

The execution unit 102 decodes and executes an instruction stored in thefirst slot. On decoding a special instruction, the execution unit 102performs addition, while instructing execution unit 103 to performsimultaneous subtraction.

Similarly, the execution unit 103 decodes and executes an instructionstored in the second slot. On decoding a special instruction, theinstruction unit 103 performs addition, while instructing execution unit102 to perform simultaneous subtraction.

The register file 112 has a plurality of registers.

Instruction Set

FIG. 4 shows the instruction set of the processor. This diagramindicates whether the processing content for each of the representativesix instructions can be allocated to the first and second slots.

In FIG. 4, an ‘instruction’ column shows the standard names ofinstructions.

A ‘mnemonic’ column shows mnemonic notations used in assembly language.These mnemonics are composed of an ‘op’ part, which represents the firstfield (operation code) and two operand parts, which represent the secondand third fields. The operand parts Rn and Rm each represent oneregister in the register file 112.

A ‘processing content’ column shows the content of an operationrepresented by the ‘op’ part.

An ‘allocated slot’ column shows whether an instruction can be placed ineach of the first and second slots (represented by the columns ‘first’and ‘second’ in the diagram). For example, a ‘mov’ data transferinstruction can be placed in the first slot, but not in the second slot.

As shown in FIG. 4, a ‘mov Rn, Rm’ instruction is a data transferinstruction for reading data from a register Rn and storing it in aregister Rm. This instruction is executed by the data transfer unit 108.An ‘add Rn, Rm’ instruction is an ‘add’ instruction for reading datafrom registers Rn and Rm, adding the read data and storing the result inregister Rm. This instruction is executed by the calculation unit 109 or110. A ‘sub Rn, Rm’ instruction is a subtract instruction for readingdata from registers Rn and Rm, subtracting the data of register Rn fromthe data of register Rm and storing the result in register Rm. Thisinstruction is executed by the calculation units 109 or 110.

Here, an ‘adsb Rn, Rm’ instruction is an add-subtract instruction forreading the data from registers Rn and Rm, performing parallel additionand subtraction on the data, and storing the result of the addition inregister Rn and that of the subtraction in register Rm. This instructionis executed by the calculation units 109 or 110.

Execution Units

The execution units 102 and 103 execute the special instruction as wellas various standard instructions.

In the execution unit 102, the decoder 104 decodes an instruction storedin the first slot and outputs a decode result, composed of controlsignals x1 and y1, for executing the instruction.

Here, if a special instruction is decoded by the decoder 104, thecontrol signals x1 instruct the calculation unit 109 to performaddition. If a standard instruction is decoded, the control signals x1instruct the data transfer unit 108 to transfer data, or the calculationunit 109 to perform a calculation. Meanwhile, if a special instructionis decoded, the control signals y1 instruct the selector 107 inside theexecution unit 103 to select input a2 (control signals y1) and thecalculation unit 110 to execute subtraction.

The selector 106 receives the control signals x1 output from the decoder104 (input a1 in FIG. 2) and the control signals x2 output from thedecoder 105 (input b1), and one of the two inputs is selected accordingto control by the decoder 105, not the decoder 104. Specifically, whenthe decoder 105 decodes a special instruction, the selector 106 selectsinput b1 (control signals x2) and when the decoder 105 decodes astandard instruction, the selector 106 selects input a1 (control signalsx1).

The data transfer unit 108 transfers data according to the controlsignals x1when a data transfer instruction is decoded by the decoder104.

The calculation unit 109 performs calculation according to the controlsignals selected by selector 106. That is, if the decoder 105 decodes aspecial instruction, the calculation unit 109 executes subtraction inaccordance with the control signals x2 selected by the selector 106.Meanwhile, if the decoder 105 decodes a standard instruction, thecalculation unit 109 performs a calculation in accordance with thecontrol signals x1 selected by the selector 106. Here, if a standardinstruction is decoded by the decoder 105 and a special instruction bythe decoder 104, addition is executed in accordance with control signalsx1.

On the other hand, in execution unit 103, the decoder 105 decodes aninstruction stored in the second slot and outputs a decode result,composed of control signals x2 and y2, for executing the instruction.

Here, if the decoder 105 decodes a special instruction, the controlsignals x2 instructs the selector 106 inside the execution unit 102 toselect input b1 (control signals x2) and the calculation unit 109 isinstructed to execute subtraction. If the decoder 105 decodes a specialinstruction, the control signals y2 instruct the calculation unit 110 toexecute addition. If the decoder 105 decodes a standard instruction, thecontrol signals y2 instruct the multiplication unit 111 to executemultiplication or the calculation unit 110 to perform calculation.

The selector 107 receives control signals y1 (input a2) output from thedecoder 104, and control signals y2 (input b2) output from the decoder105, and selects one of the two inputs according to a control by thedecoder 104, not the decoder 105. That is, when a special instruction isdecoded by decoder 104, the selector 107 chooses input a2 (controlsignals y1) and when a standard instruction is decoded, the selector 107selects input b2 (control signals y2).

The calculation unit 110 performs calculation according to the controlsignals selected by selector 107. That is, if a special instruction isdecoded by the decoder 104, the calculation unit 110 executessubtraction in accordance with the control signals y1 selected by theselector 107. Meanwhile, if a calculation instruction is decoded as astandard instruction, the calculation unit 110 performs calculation inaccordance with the control signals y2 selected by the selector 107.Here, if a standard instruction is decoded by decoder 104 and a specialinstruction by the decoder 105, addition is executed in accordance withcontrol signals y2.

If a multiply instruction is decoded by the decoder 105, multiplicationunit 111 executes multiplication in accordance with the control signalsy2.

Decoder 104

FIG. 5 is a block diagram showing the structure of the decoder 104 inFIG. 2. The decoder 104 includes a general decoder unit 1041, a specialdecoder unit 1042, an operand control unit 1043 and a multiplexer 1044.The control signals x1 described above are composed of the outputsignals x1_op (control signals corresponding to an op code), x1_r1(register number) and x1_r2 (register number) shown in the diagram.Similarly, the control signals y1 described above are composed of theoutput signals y1_op, y1_r1 and y1_r2. The content of each of thesesignals is shown in FIG. 7.

In FIG. 5, the general decoder unit 1041 receives and decodes the firstfield of an instruction. If the result is a standard instruction, thegeneral decoder unit 1041 outputs a control signals x1_op_1 indicatingthe operation content of the instruction.

The special decoder unit 1042 receives and decodes the first field of aninstruction. If the result is an ‘adsb’ instruction, the special decoderunit 1042 outputs control signals indicating the operation content ofthe ‘adsb’ instruction and instructs the operand control unit 1043 tosupply operands. Here, the control signals indicating the operationcontent of the ‘adsb’ instruction include ‘add’ control signals x1_op_2and subtract control signals y1_op.

The multiplexer 1044 receives the control signals x1_op_1 and thecontrol signals x1_op_2. If the special decoder unit 1042 has notdecoded an ‘adsb’ instruction, the multiplexer 1044 selects the controlsignals x1_op_1, but if an ‘adsb’ instruction has been decoded themultiplexer 1044 selects the control signals x1_op_2.

The operand control unit 1043 is composed of control sections 1043 a toc, each of which corresponds to one bit in each of the second and thirdfields. In the present embodiment, the second and third fields are eachcomposed of three bits. If an ‘adsb’ instruction is not decoded by thespecial decoder unit 1042, the operand control unit 1043 suppliesregister numbers (x1_r1, x1_r2) specified by the operands to the insideof execution unit 102 only. If an ‘adsb’ instruction is decoded, theoperand control unit 1043 supplies register numbers (y1_r1, y1_2)specified by the operands to the execution unit 103 as well as theexecution unit 102.

The operand control unit 1043 a is composed of gate sets 1045 and 1046and AND gates 1047 and 1048. Here, a register number Rn, control signalsx1_r1, control signals x1_r2 and the like are each three bits. Theoperand control units 1043 a to c each correspond in order to one bit ofthe three bits.

If an ‘adsb’ instruction is not decoded by the special decoder unit1042, the gate sets 1045 and 1046 output a register number Rn indicatedby the second field of the instruction as x1_r1, and a register numberRm indicated by the third field of the instruction as x1_r2. If aspecial instruction is decoded, the gate sets 1045 and 1046 output theregister number Rn indicated by the second field of the instruction asx1_r2, and the register number Rm indicated by the third field of theinstruction as x1_r1. That is, when a standard instruction is decoded,the gate sets 1045 and 1046 output the second and third fields of theinstruction in the usual order (Rn, Rm) as (x1_r1 and x1_r2), and when aspecial instruction is decoded, output the first and second fields ofthe instruction in the reverse order (Rm, Rn) as (x1_r1, x1_r2). Thereason for reversing the order is to make the operand of the secondfield the destination register for an ‘adsb’ instruction.

If a special instruction is decoded, the AND gates 1047 and 1048 outputa register Rn indicated by the second field as y1_r1, and a register Rmindicated by the third field as y1_r2. These signals y1_r1 and y1_r2,combined with y1_op, cause the execution unit 103 to perform subtractionjust as if the subtract instruction ‘sub, Rn, Rm’ had been decoded fromthe second slot and executed.

The operand control unit 1043b and c only differ from the operandcontrol unit 1043a in corresponding to different bit positions in thesecond and third fields, but apart from that have the same structure.These operand control units 1043 a to c generate signals x1_r1, xi_r2,y1_r1 and y1_r2, which are each three bits.

Decoder 105

FIG. 6 is a block diagram showing a structure of the decoder 105 in FIG.2. The content of the output signals x2_op, x2_r1 and x2_r2 is shown inFIG. 8.

The structure of the decoder 105 shown in FIG. 6 is a mirror image ofthat of the decoder 104 shown in FIG. 5. Both decoders are formed fromthe same components, and so a description of the decoder 105 is omitted.

Selectors 106 and 107

FIG. 9 shows the relationship between inputs a1 and b1 and output forthe selector 106 of FIG. 2. This diagram shows the details of whathappens when the decoder 104 decodes each of (1) an ‘add’ instruction,(2) a ‘sub’ instruction, (3) an ‘adsb’ instruction, (4) and (5) ‘mov’instructions, and (6) and (7) ‘nop’ instructions.

In the case of instructions (1) to (4) the selector 106 selects inputa1. If (1) the ‘add’ instruction and (3) the ‘adsb’ instruction arecompared, it can be seen that the control signal content x1_op of bothis addition, but that the control signal contents x1_r1, and x1_r2 arereversed in the case of the ‘adsb’ instruction. This is because theresult of the subtraction from the execution unit 103 is stored inregister Rm, causing the result of the addition from the execution unit102 to be stored in register Rn.

In the case of instructions (5) the selector 106 selects the input b1.Here the decoder 104 decodes a ‘mov’ instruction, while the decoder 105decodes an ‘adsb’ instruction in parallel. The ‘mov’ instruction and the‘adsb’ instruction are executed in parallel.

In the case of instruction (6), the selector 106 selects the input b1.Here the decoder 104 decodes a ‘nop’ instruction, while the decoder 105decodes an ‘ads’ instruction in parallel.

In the case of (7), the selector 107 selects the input al, but thecontent of control signals x1_op is no operation.

FIG. 10 shows the relationship between the inputs a2 and b2 and outputfor the selector 107 in FIG. 2. Here, the details of what happens whenthe decoder 105 decodes each of (1) an ‘add’ instruction, (2) a ‘sub’instruction, (3) an ‘adsb’ instruction, (4) a ‘mul’ instruction, (5) a‘nop’ instruction, (6) a ‘mul’ instruction and (7) a ‘nop’ instructionare shown.

In the case of instructions (1) to (3), (6) and (7), the selector 107selects the input b2. If (1) the ‘add’ instruction and (3) the ‘adsb’instruction are compared, it can be seen that the y2_op control signalcontent of both is addition, but that the control signal contents y1_r1,and y1_r2 are reversed in the case of the ‘adsb’ instruction. This isbecause the result of the subtraction from the execution unit 102 isstored in register Rm, causing the result of the addition from theexecution unit 103 to be stored in register Rn.

In the case of instruction (4), the selector 107 selects the input a2.Here the decoder 105 decodes a ‘mul’ instruction, while the decoder 104decodes an ‘adsb’ instruction in parallel. The ‘adsb’ instruction andthe ‘mul’ instruction are executed in parallel.

In the case of instruction (5), the selector 107 selects input a2. Here,the decoder 105 decodes a ‘nop’ instruction, while the decoder 104decodes an ‘adsb’ instruction in parallel.

Functional Units

FIG. 11 shows the content of operations performed by the data transferunit 108. If a ‘mov Rn1, Rm1’ instruction stored in the first slot isdecoded, the data transfer unit 108 transfers the data in register Rn1to register Rm1.

FIG. 12 shows the content of operations performed by the calculationunit 109. The diagram shows the operations for (1) a first slot ‘addRn1, Rm1’ instruction, (2) a first slot ‘sub Rn1, Rm1’ instruction, (3)a first slot ‘adsb Rn1, Rm1’ instruction and (4) a second slot ‘adsbRn2, Rm2’ instruction.

The content of the control signals s1_op for addition performed by (1)the ‘add’ instruction and (3) the ‘adsb’ instruction is the same.However, the destination register differs according to the instruction.The destination register for (1) the ‘add’ instruction is the thirdfield Rm1 and for (3) the ‘adsb’ instruction the second field Rn1. Thisis because the control signals s1_r1 and the control signals s1_r2 areswitched by the operand control unit 1043 in the case of (3) the ‘adsb’instruction.

Here, the content of control signals s1_op for subtraction performed by(2) the first s1ot ‘sub Rn1, Rm1’ instruction and (4) the second slot‘adsb Rn2, Rm2’ instruction is the same. The destination register forboth these instructions is the second field Rn1 or Rm2.

FIG. 13 shows the content of operations performed by the calculationunit 110. The calculation unit shown in this diagram is the same ascalculation unit 109 of FIG. 12 and so an explanation is not given here.

FIG. 14 shows the content of operations performed by the multiplicationunit 111. If a ‘mul Rn2, Rm2’ instruction stored in the second s1ot isdecoded, the multiplication unit 111 calculates the product of Rm2*Rn2and stores the result in register Rm2.

Program

The following is an explanation of the operation of an example programusing an ‘adsb’ instruction, which is operated by a processorconstructed as described above. It should be noted that in the followingexplanation the second and third fields of an instruction are each fourbits, and the processor has sixteen registers R0 to R15.

FIG. 15 shows an example of a source program describing a 4×4 discretecosine transform. Here, a[0] to a[3] represent as-yet unconverted data,c[0] to c[3] converted data and f0 to f2 constants. As shown in FIG. 16,each of the values a[0] to a[3], f0, f1−f2, f1+f2 and f2 is stored inadvance in the registers R0 to R7.

FIG. 17 shows an example program composed of long-word instructions forthe processor of the present embodiment. This program corresponds to thesource program of FIG. 15. The following explains each instruction inthe program in order.

First Long-Word Instruction

First Slot: ‘adsb R2, R1’

This instruction corresponds to the addition and subtraction shown inthe second and third lines of the program in FIG. 15. Using thisinstruction, the processor performs addition and subtraction in parallelon the values a[1] and a[2] stored in registers R1 and R2. The result ofthe addition b[1] is stored in register R2 and that of the subtractionb[2] in register R1.

Second Slot: ‘nop’

There is no instruction which can be performed simultaneously with theinstruction of the first s1ot, so a no operation instruction isinserted.

Second Long-instruction Word

First Slot: ‘mov R1, R8’

The processor transfers the value b[2] stored in the register R1 to theregister R8.

Second Slot: ‘adsb R3, R0’

This instruction corresponds to the addition and subtraction on thefirst and fourth lines of the program shown in FIG. 15. According tothis instruction, the processor performs parallel addition andsubtraction on the values a[0] and a[3] stored in registers R0 and R3.The resulting values b[0] and b[3] are stored in registers R3 and R0respectively.

Third Long-Word Instruction

First Slot: ‘mov R0, R9’

In response to this instruction, the processor transfers the value b[3]stored in register R0 to register R9.

Second Slot: ‘mul R5, R1’

In response to this instruction, the processor stores the product of thevalue b[2] stored in register R1 and (f1−f2) stored in register R5 inthe register R1.

Fourth Long-Word Instruction

First Slot: ‘add R9, R8’

In response to this instruction, the processor stores the sum of thevalues b[2] stored in register R8 and b[3] stored in the register R9 inregister R8.

Second Slot: ‘mul R6, R0’

In response to this instruction, the processor stores the product of thevalue b[3] stored in register R0 and (f1+f2) stored in register R6 inregister R0.

Fifth Long-Word Instruction

First Slot: ‘adsb R2, R3’

In response to this instruction, the processor stores the sum and thedifference of the values b[0] stored in the register R3 and b[1] storedin the register R2 in the registers R2 and R3 respectively.

Second Slot: ‘mul R7, R8’

In response to this instruction, the processor stores the product of thevalue (b[2]+b[3]) stored in register R8 and f2 stored in register R7 inregister R8.

Sixth Long-Word Instruction

First Slot: ‘add R8, R1’

In response to this instruction, the processor stores the sum of thevalue (b[2]*(f1−f2)) stored in register R1 and the value((b[2]+b[3])*f2) stored in register R8, that is the value c[2], in theregister R1.

Second Slot: ‘mul R4, R2’

In response to this instruction, the processor stores the product of thevalue (b[0]+b[1]) stored in register R2 and the value f0 stored inregister R4, that is the value c[0], in register R2.

Seventh Long-Word Instruction

First Slot: ‘sub R8, R0’

In response to this instruction, the processor stores the differencebetween the value (b[2]*(f1−f2)) stored in register R0 and the value(b[2]+b[3]*f2) stored in register R8, that is the value c[3], in theregister R0.

Second Slot: ‘mul R4, R3’

In response to this instruction, the processor stores the product of thevalue (b[0]−b[1]) stored in the register R3 and the value f0 stored inthe register R4, that is the value c[1], in the register R3.

Use of the ‘adsb’ instruction enables processing to take placeefficiently, as the program example shown above demonstrates. Here, theprocessor can execute the ‘adsb’ instruction and the ‘mul’ instructionsimultaneously, as in the fifth long-word instruction, so thatproduct-sum calculations can be executed efficiently as shown in thisprogram. In actual image compression processing, a number of product-sumcalculations need to be performed for each image block, so that verymany product-sum calculations are performed for each frame. Thus, use ofthe ‘adsb’ instruction can greatly increase the processing rate.

FIG. 18 shows a program used by a conventional processor, having twoinstruction s1ots, which does not use the ‘adsb’ instruction. Thisprogram sequence also corresponds to the source program in FIG. 15. Fromthis it can be seen that a conventional processor needs ten long-wordinstructions to operate the program, while the processor in the presentinvention requires only seven.

Here, the add-subtract instruction can be placed in either the first orsecond s1ot, but a construction in which an add-subtract instruction canbe placed in only one of the two s1ots may alternatively be used. Forexample, the processor shown in FIG. 2 can be constructed without theselector 107. In this case, an ‘adsb’ instruction can only be placed inthe first slot.

While each register in the above explanation stores one piece of datateach register may be divided, for example, into an upper and lowerfield. These fields store two pieces of data sequentially, with eachtaking up half of the register width. This is known as SIMD (SingleInstruction Multiple Data) format. In this case, add instructions,subtract instructions, add-subtract instructions and multiplyinstructions may be executed by performing the required calculation onvalues stored in either the upper or the lower fields of two registers.The result of the calculation is stored in the original field in one ofthe registers. For an ‘adsb’ instruction, the content of the tworegisters can be switched, as shown in the present embodiment. Registersmay of course be divided into three or more fields using SIMD format.

Furthermore, the processor in the present embodiment is a VLIWprocessor, but a superscalar processor may also be used. In this case,the processor includes a retrieving unit, which retrieves twoinstructions that can be executed simultaneous1y from a serialinstruction sequence. The two retrieved instructions are stored in thefirst and second slots and executed by execution units 102 and 103.

The number of instructions executed in parallel in the presentembodiment is two, but it may alternatively be three or more.

Program Conversion Apparatus

FIG. 19 is a block diagram showing the structure of a program conversionapparatus, which converts a source program into a program (executioncodes) for the processor shown in FIG. 2. This program conversionapparatus is realized by executing software describing each of thefunctions shown in FIG. 19 on hardware such as a conventionalworkstation or personal computer.

A program conversion apparatus shown in FIG. 19 includes a compiler 201and a link editing unit 214. The compiler 201 has a compiler upstreamunit 210, an assembly code generating unit 211, an instructionscheduling unit 212 and an object code generating unit 213. The compiler201 converts a source program 200 stored on hard disk into an objectprogram 220.

The compiler upstream unit 210 reads the source program 200 from thehard disk and performs syntactic and semantic analysis on the readsource program. The compiler upstream unit 210 then generates anintermediate program composed of internal format codes (hereafterreferred to as ‘intermediate codes’) from the results of this analysis.

The assembly code generating unit 211, having a retrieving unit 211 a,generates an assembly program composed of assembly codes (instructionswritten in mnemonic format) from the intermediate program generated bythe compiler upstream unit 210.

In order to generate an assembly program, the retrieving unit 211 aretrieves an intermediate code indicating an addition of two variablesand an intermediate code indicating a subtraction of the same twovariables from the intermediate program. The assembly code generatingunit 211 generates an ‘adsb Rn, Rm’ instruction for the pair ofintermediate codes retrieved by the retrieving unit 211 a.

For convenience's sake, the source program shown in FIG. 15 is treatedas an intermediate program. First, the retrieving unit 211 a retrievesvariables for an intermediate code denoting addition (for example theintermediate code on the first line) from the intermediate program.Furthermore, by retrieving an intermediate code, which performssubtraction using the same variables (the intermediate code of thefourth line), the retrieving unit 211 a retrieves a pair of intermediatecodes, ie those of the first and fourth lines. The retrieving unit 211 aperforms the above processing for each intermediate code denotingaddition. As a result, in FIG. 15 three pairs, the first and fourthlines, the second and third lines and the seventh and eighth lines, areretrieved. The assembly code generating unit 211 generates an ‘adsb’instruction for each pair.

The instruction scheduling unit 212, having a dependency analysis unit212 a and an instruction allocation unit 212 b, arranges the assemblycodes within the assembly program in parallel according to thespecification of the target processor. In the present embodiment, theprocessor of FIG. 2 is the target, so the instruction scheduling unit212 arranges two instructions in parallel. Here, if two instructionswith the required dependency are not available, the instructionscheduling unit 212 inserts a ‘nop’ instruction.

The dependency analysis unit 212 a analyzes the dependency ofinstructions in the assembly program generated by the assembly codegenerating unit 211. Here, instruction dependency is divided into threekinds: data dependency, reverse dependency and output dependency. Datadependency is the dependency of an instruction referring to a certainresource (register or memory) on an instruction defining the sameresource. Reverse dependency is the dependency of an instruction thatdefines a certain resource on an instruction that refers to the sameresource. Output dependency is the dependency of an instruction thatdefines a certain resource on another instruction that also defines thatresource. If the execution order of a pair of dependent instructions isswitched, an error will occur in the program, so it is vital to preservethe original execution order of such instructions.

The instruction allocation unit 212 b, following the result of analysisby the dependence unit 212 a, arranges two non-dependent instructions inparallel as a long-word instruction. In doing so, the instructionallocation unit 212 b retrieves a non-dependent multiply (‘mul’) ortransfer (‘mov’) instruction for each ‘adsb’ instruction in the assemblyprogram. On retrieving a multiply instruction, the instructionallocation unit 212 b assigns the ‘adsb’ instruction to the first slotand the ‘mul’ instruction to the second slot in parallel. On retrievinga transfer instruction, the instruction allocation unit 212 b assignsthe transfer instruction to the first slot and the ‘adsb’ instruction tothe second slot in parallel. If a ‘mul’ instruction or ‘mov’ instructionwhich is not dependent on an ‘adsb’ instruction does not exist, theinstruction allocation unit 212 b places a ‘nop’ instruction and an‘adsb’ instruction in parallel.

The object code generating unit 213 generates the object program 220,which is composed of machine language instruction codes, from theassembly program arranged in parallel by the instruction scheduling unit212. That is, each assembly code in the assembly program that has beenplaced in parallel is converted into a machine language instructioncode.

A linker 214 generates an executable program 230 by joining the objectprogram generated by the object code generating unit 213 with anotherobject program. The program sequence of long-word instructions shown inFIG. 17 is an example of an execution format program. It should benoted, however, that this drawing uses mnemonic notation.

The program conversion apparatus in the above embodiment converts an addinstruction and subtract instruction for the same two operands into one‘adsb’ instruction. Furthermore, ‘adsb’ instructions are arranged inparallel with ‘mov’ or ‘mul’ instructions. As a result, the programconversion apparatus can generate long-word instructions sequencessuitable for a processor like the one in FIG. 2.

Here, in the above program conversion apparatus, the retrieving unit 211a retrieves pairs of intermediate codes from the intermediate program,each pair including intermediate codes for an addition and asubtraction. However, as an alternative, a pair of source codesindicating an addition and a subtraction may be retrieved from thesource program. In this case, a construction in which the compilerupstream unit 210 generates intermediate codes, indicating addition andsubtraction, from the retrieved pair of source codes is used.

As a further alternative, the retrieving unit 211 a may retrieve an addand subtract instruction pair from the object program. In this case, aconstruction in which the retrieved pair is replaced with an ‘adsb’instruction by the assembly code generating unit 211 or the instructionscheduling unit 212 is used.

It should be noted that the target processor may also be a modifiedversion of the one in FIG. 2. For example, if a construction in which an‘adsb’ instruction can only be placed in one of the slots, or in whichthree or more instructions are arranged in parallel is used,instructions may be suitably arranged in parallel by the instructionallocation unit 212 b.

Although the present invention has been fully described by way ofexamples with reference to accompanying drawings, it is to be noted thatvarious changes and modifications will be apparent to those skilled inthe art. Therefore, unless such changes and modifications depart fromthe scope of the present invention, they should be construed as beingincluded therein.

1-20. (canceled)
 21. A processor for executing a plurality ofinstructions in parallel, comprising: a first functional ,unit, a secondfunctional unit, and a third functional unit each of which is configuredto execute an instruction, wherein the processor is configured toexecute a plurality of a first type of instructions in parallel, and aplurality of instructions including a second type of instructions inparallel such that: when a plurality of a first type of instructions areexecuted in parallel the first functional unit is capable of executingan instruction in parallel with an execution of another instruction bythe second functional unit; and when a plurality of instructionsincluding a second type of instruction are executed in parallel, thethird functional unit is capable of executing an instruction, which isdifferent from the second type of instruction, in parallel with anexecution of the second type of instruction by the first functional unitand the second functional unit.
 22. The processor of claim 21, whereinthe first type of instructions are any types of instruction other thanthe second type of instruction.
 23. The processor of claim 22, furthercomprising: a plurality of execution groups, the first, second, andthird functional units being located in the execution groups; whereinthe first functional unit and the third functional unit are in differentexecution groups, and wherein, when a plurality of the first type ofinstructions are executed in parallel, the third functional unit iscapable of executing an instruction in parallel with an execution ofanother instruction by the first functional unit.
 24. The processor ofclaim 23, wherein the first functional unit and the second functionalunit arc in different execution groups.
 25. The processor of claim 24,wherein, when a plurality of the first type of instructions arc executedin parallel, functional units in different execution groups are capableof executing the instructions in parallel.
 26. A processing method forexecuting a plurality of instructions in parallel using a firstfunctional unit, a second functional unit, and a third functional unit,each of the functional units being configured to execute an instruction,the method comprising: executing a plurality of a first type ofinstructions in parallel, and a plurality of instructions including asecond type of instruction in parallel, when a plurality of a first typeof instructions are executed in parallel, the first functional unitexecutes an instruction in parallel with an execution of anotherinstruction by the second functional unit; and when a plurality ofinstructions including a second type of instruction arc executed inparallel, the third functional unit executes an instruction, which isdifferent from the second type of instruction, in parallel with anexecution of the second type of instruction by the first functional unitand the second functional unit.
 27. The method of claim 26, wherein thefirst type of instructions are any types of instruction other than thesecond type of instruction.
 28. The method of claim 27, wherein aplurality of execution groups are provided in which the first, second,and third functional units are located, wherein the first functionalunit and the third functional unit are in different execution groups,and wherein, when a plurality of the first type of instructions areexecuted in parallel, the third functional unit executes an instructionin parallel with an execution of another instruction by the firstfunctional unit.
 29. The method of claim 28, wherein the firstfunctional unit and the second functional unit are in differentexecution groups.
 30. The method of claim 29 wherein, when a pluralityof the first type of instructions are executed in parallel functionalunits in different execution groups execute the instructions inparallel.
 31. The processor of claim 23 wherein the first type ofinstructions are standard instructions and the second type instructionis a special instruction.
 32. The processor of claim 31, wherein each ofthe standard instructions requires one functional unit for execution andthe special instruction requires two functional units for execution. 33.The method of claim 28, wherein the first type of instructions arestandard instructions and the second type of instruction is a specialinstruction.
 34. The method of claim 33, wherein each of the standardinstructions requires one functional unit for execution and the specialinstruction requires two functional units for execution.